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TECHNICAL MEMORANDUM 


HARDWARE MATH FOR THE 6502 MICROPROCESSOR 
INTRODUCTION 


Floating-point arithmetic is generally a time-consuming task, especially on an 8-bit microprocessor. 
The system described here is the result of a growing arithmetic workload in a real-time control system 
using the given 6502 microprocessor. Four AMD 951 lA’s (Intel 8231 A) were used in parallel and con- 
nected directly to the 6502. The 6502 (in the Rockwell AIM 65) was clocked at 2 MHz and the 951 l’s 
at 4 MHz. 

Originally, the AIM 65 was to read data from six gyros and six accelerometers (two complete 
inertial navigation systems) and from two resolvers, then send these data to a central computer once 
each 20 ms. Then, more computation became necessary as strapdown algorithms, control algorithms, and 
finally, everything except mass storage was added to the software. The system described here does all 
this arithmetic in the required time and can be adapted to any 6502 system to give an arithmetic speedup 
of about 100 times over BASIC. 


HARDWARE 


Figure 1 shows the timing diagrams for a 6502 at 2 MHz and a 951 1 
asynchronous which makes interfacing easier. The biggest hardware problem 
cannot be stopped during a write. The problem with timing is to obtain the 
WR going high and CS. A0 going high. 

Figure 2 shows the one-951 1 interface for a typical 6502 system and 
four-95 1 1 interface as described in this report. 

Reference 1 gives deta Is on interfacing a 951 1 to a 6502 by using a 6522 VIA interface chip. 
This method is straightforward to implement, bu. has so much overhead that it is primarily useful only 
for trigonometric functions. 

Reference 2 gives details of another method for directly connecting the 9511 to a 6502 (OSI 
Superboard II). The method was optimized to the OSI board, but could probably have been adapted 
as needed had it been discovered soon enough. 

DMA is generally the fastest method that could be used, but was not worked out for the 6502. 

It would be quite complex and may not actually be any faster than the method used here. 

The actual hardware connections between the 6502 bus and the 9511’s went through a Compu- 
terist, Inc., DRAM board which was already pait of this AIM 65 system. Chip select (CS) was also done 
by the DRAM board, although this was only for convenience. Another, more direct method to generate 
the control signals could be used so long as the same timing is kept. 


at 4 MHz. The 9511 is 
with the 6502 is that it 
25 ns minimum between 


Figure 3 shows the full 


Interfaces for other microprocessors (using one 9511) are generally simpler than that for the 
6502 and are already published by AMD. 


SOFTWARE 


In the four-APU system all software was handwritten. There is about 5K object code and 3K 
BASIC. Reference 2 used a compiler, but none was written here. More than one APU implies overlap 
and critical timing considerations to get maximum speed and efficiency, so a compiler did not seem 
feasible. 

A simulation was written for the desktop HP 9845A, which plotted timelines for the CPU and 
each of tl e four APU's. Busy and idle times are all clearly shown, so code could be inserted if needed. 
Any APU overlap (sending to a busy APU, etc.) was also flagged (see Appendix). 

Another simulation of the actual arithmetic was written for the HP 9845A so numerical correct- 
ness could be checked. This simulation did the same calculations, in the same order, as the AIM 65 
system itself. Still another simulation (done by a contractor as part of an overall system study) was 
made to do the original theoretical algorithms. Eventually, all results agreed and the timeline indicated a 
worst-case time less than 20 ms. 

Hardware is arranged so that each APU, or any combination of APU’s, has a unique address. 

Data are sent by doing a LDA data then STA address which takes 4 /is *otal. One floating-point number 
for an APU contains 4 bytes so the minimum time to load is 16 /is. A command will usually only 
require 3 /is to send — LDA immediate, STA address. A data read requires 4 /is. 

Two methods were used to determine when an APU is finished. One is to check the busy bit 
in the status register of a particular APU. Hie other is reading a special address which contains the four 
APU END lines as the lower fou r bits and zeros in the upper four. This extra hardware was added to 
increase CPU response time and reduce software volume since more than one APU status check would 
normally require a separate read and check for each overlapped APU. Waiting for all end lines to go low 
requires only one read of this added register. 

Somet mes no check of APU status was made at all, since it was known that the APU could not 
be busy. Interrupts were not used since response time is much too slow (at least 21 /is at 2 MHz), plus 
I/O uses interrupts also (conflict). 

The hardware was designed such that the CPU will halt when an APU read is attempted and that 
APU is busy. It will remain stopped until the APU is no longer busy. On a write, however, no attempt 
is made to stop the CPU since it won’t stop during a write. It will stop only on the next nen-write 
instruction. This causes a continual problem and requires implementing a check to be sure the APU is 
not busy before writing to it. Figure 4 is an example of how an APU can be used. 

There is another 65C2 peculiarity that has to be kept in mind when the memory map is being 
laid out. It only applies when indexed instructions are used across a page boundary and reflects the 
6502 design rules. During one microcycle the address of the incorrect preceding page is actually put on 
the bus and could activate whatever was there. This is always a read, but an APU could still be triggered 
if it happened to have that address. 
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The APU’s have one floating-point format — AIM 65 BASIC memory has another. Since BASIC 
was used as the controlling language, conversion in both directions had to be made. All constants and 
one-time calculations were done in BASIC and converted to APU format (Fig. 5). When results from an 
APU are displayed by BASIC they must be converted to BASIC format (Fig. 6). The APU format is: 


23 0 


M 

S 

E 

_S_ 


1 

M3 

M2 

Ml 



k EXPONENT -^| 

t — 


— MANTISSA 


*1 


Exponent is unbiased 2’s complement 7 bits (-64 to +63) 
Bit 23 = 1 except for zero which is all zeros. 

Mantissa sign: 0 = +, 1 = - 

and the BASIC format is 


E 

S 


'M 

IS 


M3 


M2 


Ml 


MO 


-EXPONENT- 


Exponent: $81 = +1; $80 = 0; 7F = -1 etc. (biased). 
Bit 7 of M3 is Mantissa sign: 0 = +, 1 = -. 

0 = all five bytes = 0. 


Instruction sequences were always done to make maximum use of otherwise idle CPU time and 
even APU time when one is free. Registers can be loaded and other calculations and operations inserted 
to eliminate the idle time. The timeline simulation was used to do this efficiently. 

Minimum software and maximum speed required a tradeoff of in-line versus subroutine code. 
In-line is considerably faster (16 /is versus 22 ns + overhead), but subroutines require much less memory 
(EPROM eventually), so subroutines were generally used since enough time was thought to be available 
(it just was). There were also many tradeoffs on types of assembly instructions to use to keep the 
number of subroutines to a minimum, but still allow fast execution. Absolute indexed mode was gen- 
erally used, since some versatility is available without the speed penalty of indirect instructions. For a 
particular subroutine, the assembler can increment the absolute part with the index then not needing 
changed at nin time. 

Error checks were not made in this application. It seemed that too much time would be wasted 
looking for the only two errors that could occur — overflow and underflow. Underflow should not be 
ignored, actually, it must be worked around. Underflow does not result in zero, but instead, a change 
in exponent sign. This could be disastrous. So, scaling must be done to prevent it, it must be checked, 

or safety checks must be made in case it occurs. Underflow occurs at about ±2.7 x 10'"®, overflow 

about ±9.2 x 10'®, and overflow means there’s a hardware problem which will immediately appear 
elsewhere in the results. 

Software reset of the APU’s was not implemented. The AIM 65 has this as a manual button, if 
it should ever be required. If it is required, it means there is a noise problem or a hardware problem that 
must be fixed. 
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PERFORMANCE 


The quad system in this application is running at 0.026 MFLOP. This includes CPU overhead, 
APU idle time, APU stack manipulations, etc. An 8086-8087 at 5 MHz would do about 0.021 MFLOP. 

A 6 MHz 68000-16081 combination should be around 0.050 MFLOP and an HP9000 using compiled 
BASIC without floating-point hardware about 0.075 MFLOP. 

Overhead generally doubles the time it takes to do any particular operation. Data must be moved 
into and out of the APU, stack changes must be made to obtain higher efficiency, and the CPU must 
sometimes wait for an APU to finish. 

More APU’s could be tied to one APU (2 MHz), but four is about the limit for simple arithmetic. 
If trigonometric functions were the primary requirement, then up to 10 APU’s could be kept busy. 

FORTH would be much better than BASIC for most applicitions where the maximum speed is 
not required. Pure assembly code squeezes an extra factor of 5 to iO out of the hardware. With 
FORTH, little or no assembly code would be needed. 

Using all in-line code, (MACROS^ could increase the overall speed by maybe 10 percent, but at 
a cost of three times the original program memory (5K to 15K here). 


SUMMARY 


The quad-95 1 1 system described here works rather well, but is time consuming to program and 
debug. A typical application would only have one APU, which would be much easier to use. One APU 
would also make writing a compiler a feasible and useful task. 

Overlapped APU operation makes program changes something to be done with great care. Also, 
the carry flag and the X-register sometimes are expected to retain their value through several subroutines. 
The CPU is often doing non-APU operations while some APU’s are still busy. Overflow and underflow 
may be a problem and must be considered, particularly in a real-time system. The LSS unit continues to 
operate as expected. 
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9511 A DIRECT ON BUS 



Timing diagrams. 



RULES 


1) CHIP ENABLE CS AND AO MUST BE STABLE BEFORE RO OR Wfc GO LOW. 

2) RD MUST RETURN HIGH £J) NS BEFORE CS GOES HIGH OR AO CHANGES. 

3) WR MUST RETURN HIGH AT LEAST 26 NS BEFORE CS GOES HIGH OR AO CHANGES. 

4) ON READ CYCLE NOT READY MUST BE SET LO BEFORE 0 2 GOES HIGH. 


•US 



-CLK THE CLOCK FREQUENCY FOR THE 8231A IS 4 MHZ. 

00 THE AIM 66 CHIP CLOCK IS 2 MHZ OR LESS 

ADDRESS LINES (INCLUDES VS1) SHOULD BE BUFFERED. 


Figure 2. Single-unit interface. 
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Figure 3. Quad-unit interface. 








































NUM1-74 
NUM2*6 1 
RPUDM90 1 0 

<DATfl> 

APUCM901 1 

<CMD> 

DRTRO-4700 

(RESULT ) 

*-4800 
LDR #NUM1 
STfl RPUD 
LDR *0 
STfl RPUD 
STR RPUD 
STR RPUD 

(FIRST #> 

LDR #41C 
STR RPUC 

(FLOAT) 

fll LIT RPUC 
BMI R 1 

(WRIT) 

LDR #NUM2 
STfl RPUD 
LDR #0 
STR RPUD 
STR RPUD 
STR RPUD 

(SECOND # 



LDR 

MIC 



STfl 

RPUC 

(FLOAT) 

R2 

BIT 

RPUC 



BMI 

R2 

(WRIT) 


LDR 

#412 



STfl 

RPUC 

(FMUL) 

R3 

BIT 

RPUC 



BM! 

R3 

(WRIT) 


LDR 

#4 1 E 



STfl 

RPUC 

(FIXED) 

R4 

BIT 

RPUC 



BMI 

R4 

(WRIT) 


LDR 

RPUD 



STfl 

DfiTRO 



LDR 

RPUD 



STfl 

DATAO+1 



LDR 

RPUD 



STR 

DATRO+2 



LDR 

RPUD 



STR 

DRTRO+3 

(RESULT 


BRK 




.END 


Figure 4. Single-unit multiply example. 
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BNE 
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#3 
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• 0 

T03 

STA 

< BUF82 > , Y 


DEY 



BNE 

T 03 


RTS 


TOl 

CMP 

• «C0 


BCC 

T04 


LDA 

#*3F 


BNE 

T05 

T04 

CMP 

#*40 


BCS 

T06 


LDA 

4*40 

T05 

ASL 

A 


TAX 



INY 



LDA 

''BADR), Y 


ASL 

A 


TXA 



ROR 

A 


DEY 



STA 

(BUF82), Y 


LDA 

#*FF 


LDY 

#3 

T02 

STA 

< BUF82 ) , Y 


DEY 



BNE 

T02 


RTS 


T06 

ASL 

A 


TAX 



INY 



LDA 

< BADR ) , Y 


CMP 

4*89 


ORP 

4*90 


STA 

(BUF8Z >, Y 


TXA 



ROR 

A 


DEY 



STA 

< BUF82 ) , Y 


LDY 

#2 


LDA 

(BADR), Y 


STA 

( BUF82 > , Y 


INY 



LDA 

(BADR), Y 


STA 

( BUF92 ) , Y 


RTS 



USES 42 USEC 

BASIC VAR LOC <5 BYTE":.) 

8231 BUFFER <4 BYTES) 


CHECK IF EXP -0 
AND IF SO SET 
ALL 8231-0 


<C8 — > NO OVF 
CLEAR --> < 

OVF 

>■40 — > NO UF 
SET >« 

UF 

LEFT ONE SC CAN RCR 
NEED ACC 
Y 1 

P'JT M. SIGN INTO CARRY 
GET OVF OP UF 
APPEND CARRY TO 17 
Y-0 


ALL ONES IN 
MANTISSA 
(INCL BIT 23) 


LINE UP FOR LATER ROR 
NEED ACC SAVED 
Y» 1 

SET CARRY IF BIT 7 SE" 
BIT 7 TO 1 

RECALL DATA 
APPEND CARRY 
Y»0 


DIRECT TRANSFER OF 
THESE BYTES 
(TRUNCATE 4 TO 3) 


Figure 5. AIM-to-823l format conversion. 


TOBA 
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BR2 


BR3 


BR4 


*■*3100 

USES 54 USEC 

LDY 

• 4 


LDR 

*0 


STfl 

< BflDR ) , Y 

ALWAYS ZERO 

DEY 


Y*3 

LDR 

(BUF82 > , Y 


STfl 

(BflDR), Y 


DEY 


3,2, 1 BUT NOT Y=0 

BNE 

Bfll 


AND 

#*FF 

CHECK I F*0 ALWAYS (SET FLAGS) 

BNE 

BA2 


STfl 

( BflDR) , Y 

BYTE ZERO ALSO=0 

RTS 



LDR 

(BUF82 ) , Y 

NOT*0 

BNI 

BA3 

SKIP IF M.S. OK 

INY 


Y* 1 

LDR 

(BUF82) , Y 


AND 

•*7F 

M.S. =0 

STfl 

(BflDR), Y 


DEY 


Y=0 

LDR 

(BUF82 ) , Y 


AND 

#*7F 

CLEAR M.S. BIT (EXP NOW) 

CMP 

#*40 

SET CARRY IF >= *40 

BCS 

BA4 


ORA 

#*80 

SET E.S. 

STfl 

(BflDR), Y 


RTS 




Figure 6. 8231-to-AIM format conversion. 


APPENDIX 


TIMELINE SIMULATION 


The entire simulation is somewhat lengthy, so only the subroutine that actually calculates the 
time intervals will be discussed. It can then be used as needed in large systems. All other subroutines 
eventually call this one. It calculates and plots CPU time plus the four APU times, idle as well as busy. 

It checks for illegal overlap, namely, writing to an already busy APU. Preventing such overlap is the 
software designer’s responsibility since the hardware cannot prevent it. 

There are six subroutine parameters. T1 is the total CPU time including any subroutine call 
overhead. The assumed normal situation is that a subroutine is being simulated. If not, then T1 is made 
negative and the call overhead time is not included. The user is expected to include the subroutine 
return overhead time whenever it occurs since this is CPU time. Subroutine call overhead is automatically 
added unless T1 is negative. The return overhead is not automatically handled because it often is sepa- 
rated from the first part of its subroi me and because it is easier to handle than the call overhead. 

Dt is the delta time (CPU) until an APU is first operated. The subroutine call overhead, if any, 
is included in Dt. Roth T1 and Dt are measured from the time before a subroutine call was initiated, 
if any. 


The P value is the time that an APU is committed to a task, either executing an instruction, or 
loading or unloading a command or data. A minus sign means the CPU will wait until that APU is ready. 

If Grf = 1 , then the full graph will be drawn. If Grf = 0, only execution times will be available 
to be printed (by external routines). The graph shows busy time for the CPU and each of the four 
APU’s. Available CPU an H APU times are immediately visible. The final time is always available and it 
can be worst-case (typically) or otherwise, depending on needs. Other items such as percent utilization, 
efficiency, etc., of various schemes could be added. 

Any sequence of machine code can be simulated with this subroutine. Pure CPU time can be 
done as well as all overlapped APU’s. Routines may have to be split in various ways to make them fit. 
Variable times will need to be fixed, with, usually, either the minimum or maximum value. 

Figure A-l is the subroutine listing as written in HP 9845 BASIC. It assumes a 2 MHz 6502 and 
4 MHz APU’s. The calculation portion can easily be adapted to another machine, but the graphics may 
be more involved. 

Figure A-2 shows the results when this subroutine is applied to the LSS project. Changes are 
easy to make and their effects easy to see. With so much overlap taking place, making a change often 
produces an unexpected result. Times used here are in microseconds and there are no fractional values 
permitted. Calculated final time is 19.5 ms worst case, and measured times were in the 18 ms range. 

Earlier results indicated over 20 ms (22.1) and various otherwise idle times were put to use in 
achieving the reduction. Further reduction would be quite difficult. Only a 10 percent reduction as 
obtained here is probably not typical, since the original code was already tightly written. 
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ORIGINAL PAGE IS 
0E POOR QUALITY 

1090 SUB C<Tl,Dt,Pl,P2,P3,P4) 

1100 COM X1,X2, Y1,Y2,21,22,E1,E2, VI , C, Xx, Yy , Fadd, Fmul , Nop, Fsub, F 1 t s , Pt of , Popf , Xc 
hf , Chsf , F i xs , Grf 
1110 Cc=C+3*<Tl>=0> 

1120 Dc = Dt -3»< T 1 > = 0 ) 

1130 M=MFlX< <X2-Cc)*(Pl<0), < Y2-Cc >*<P2<0> , <22-Cc >*<P3<0> , <E2-Cc )*<P4<0) , 0) 

1140 M=INT<INT<<M+l>/3.5)*3.5) 

1150 Mm=Cc+Dc+M 
1160 Cx = Mm 

1170 IF <P1>0) AND < X2 > =Cx ) THEN PRINT "APU1 OVERLAP C ; T 1 ; Dt ; X2-Mm 

1130 IF <P2>0) AND <Y2)=Cx) THEN PRINT " APU2 OVERLAP C ; T 1 ; Dt. ; Y2-Mm 

1190 IF < P3 >0 > AND < 22 ) =Cx > THEN PRINT " APU3 OVERLAP C ; T 1 ; Dt ; 22-Mm 

1200 IF < P4 >0 ) AND < E2 ) =Cx ) THEN PRINT " AF'U4 OVERLAP H ; C ; T 1 ; Dt J E2-.’m 

1210 T 1 = ABS < T 1 ) + M 

1220 IF <P1=0) OR < P 1 = - 1 > THEN 1250 
1230 IF X2< =C THEN Xl=Mm 
1240 X2=Mm+ABS<Pl ) 

1250 IF < P2=0 ) OR < P2=- 1 ) THEN 1280 
1260 IF Y2< =C THEN Yl=Tm 
1270 Y2 = Mm + ABS < P2 ) 

1280 IF < P 3 = 0 > OR < P3 = - 1 ) THEN 1310 
1290 IF 22< =C THEN 21=Tm 
1300 22=Mm+ABS(P3) 

1310 IF < P4=0 ) OR < P4 = - 1 ) THEN 1340 
1320 IF E2< =C THEN E 1 m 

1339 E2=Mm+ABS<P4) 

1340 REM print C; T l ; Dt ; M ; Mm ; P l ; P2; P3; P4 , x l ; X2 ; Y l ; Y2 ; 21 ; 22; E 1 ; E2 
1350 IF Grf =0 THEN C=C*T1 

1360 IF Grf =0 THEN 1640 

1370 Ty=Yy*Xx/6 

1380 FOR T=C TO C+Tl-1 

1390 IF <TOINT<T/Ty>*Ty) OR <T = 0) THEN 1440 
1400 DUMP GRAPHICS 
1410 V 1 =-5 
1420 GCLEAR 

1430 GRID 10,6,0,Yy,5, 1,2 
1440 Tt =T MOD Xx 
1450 Ts = Tt + 1 

1460 IF ( Tt =0 > AND (TO0) THEN Vl=Vl+6 
1470 IF ( T > = Cc ) AND <T<Cc+M) THEN 1500 
1480 MOVE Tt , V 1 
1490 DRAW Ts , V 1 

1500 IF < T< X 1 ) OR <T>=X2) THEN 1530 
1510 MOVE Tt , V 1 + 1 
1520 DRAW Ts , V 1 + 1 

1530 IF ( T< Y 1 > OR ( T > = 't 2 ) THEN 1560 
1540 MOVE T t , V 1 + 2 
1550 DRAW Ts , VI +2 

1560 IF ( T< 2 1 ) OR <T>=22) THEN 1590 
1570 MOVE T t , V 1 + 3 
1580 DRAW T s , V 1 + 3 

1590 IF ( T<E 1 ) OP < T > = E 2 ) THEN 1620 

1600 MOVE Tt , V 1 +4 

1610 DRAW Ts , V 1 +4 

1620 NEXT T 

1630 C=T 

1640 SUBEXIT 

Figure A-l. Timeline subroutine. 
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F ! NftL TIME 19455 



Figure A-2. LSS timeline. 
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